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Abstract 

In statistical practice, whether a Bayesian or frequentist approach is used in 
inference depends not only on the availability of prior information but also on 
the attitude taken toward partial prior information, with frequentists tending to 
be more cautious than Bayesians. The proposed framework defines that attitude 
in terms of a specified amount of caution, thereby enabling data analysis at the 
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level of caution desired and on the basis of any prior information. The caution 
parameter represents the attitude toward partial prior information in much the 
same way as a loss function represents the attitude toward risk. When there 
is very little prior information and nonzero caution, the resulting inferences 
correspond to those of the candidate confidence intervals and p-values that are 
most similar to the credible intervals and hypothesis probabilities of the specified 
Bayesian posterior. On the other hand, in the presence of a known physical 
distribution of the parameter, inferences are based only on the corresponding 
physical posterior. In those extremes of either negligible prior information or 
complete prior information, inferences do not depend on the degree of caution. 
Partial prior information between those two extremes leads to intermediate 
inferences that are more frequentistic to the extent that the caution is high and 
more Bayesian to the extent that the caution is low. 

Keywords: ambiguity; blended inference; conditional Gamma-minimax; confidence 
distribution; confidence posterior; Ellsberg paradox; imprecise probability; maximum 
entropy; maxmin expected utility; minimum cross entropy; minimum divergence; min- 
imum information for discrimination; minimum relative entropy; observed confidence 
level; robust Bayesian analysis 



1 Introduction 



The controversy between Bayesianism and frequentism may be irresolvable to the 
extent that it reflects honest differences in personal attitudes of statisticians rather 
than differences in their knowledge or rationality. As lEfronl ( 120051 ) pointed out, 



The Bayesian-frequentist debate reflects two different attitudes about the 
process of doing science, both quite legitimate. Bayesian statistics is well 
suited to individual researchers, or a research group, trying to use all of 
the information at its disposal to make the quickest possible progress. In 



pursuing progress, Bayesians tend to be aggressive and optimistic with 
their modeling assumptions. Frequentist statisticians are more cautious 
and defensive. One definition says that a frequentist is a Bayesian trying 
to do well, or at least not too badly, against any possible prior distribution. 
The frequentist aims for universally acceptable conclusions, ones that will 
stand up to adversarial scrutiny. 

On one hand, methodology reflecting extreme caution in the form of the minimax-like 
attitude attributed to frequentists and, on the other hand, methodology reflecting the 
extreme reliance on modeling assumptions attributed to Bayesians both play useful 
roles in statistical inference. Building on that premise, the idea motivating this paper 
is that methodology for moderate amounts of caution also has a place in practical 
data analysis. The extent of such caution will be formally defined in order to facilitate 
making statistical inferences at the level of caution appropriate to the situation. 

The formal definition will build on previous work to formalize caution in the 
face of uncertainty. Attitudes toward uncerta i nty ha ve long been mathematically 



modeled in the economics literature. 



Ellsbergl (119611 ) identified two distinct types 



of uncertainty: risk is the variability in an unknown quantity that threatens assets, 
whereas ambiguity is ignorance about the extent of such variability. The same agent 
may be much more cautious toward risk than toward ambiguity or vice versa. A 
utility or loss function can model an agent's attitude toward risk but not its attitude 
toward ambiguity. Because frequentist actions can differ from Bayesian actions given 
the same loss function, the ambiguity attitude is much more pertinent than the risk 
attitude to the concept of caution needed to represent and balance the two basic 
ap proaches t o stat istical inference. 

Ellsberg) ( 1l96ll ) distinguished "pessimism" from "conservatism": the former is an 
excessive belief that worst-case scenarios will materialize, whereas the latter only in- 
volves cautiously acting as if they will. In other words, the attitude of "hoping for 
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Table 1: Utility function for actions I-II and the three possible states of nature. 
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$0 


$100 
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$100 


$100 



Table 2: Utility function for actions III-IV and the three possible states of nature. 

the best, preparing for the worst" is consistent with conservatism but not pessimism. 
While that attitude does motivate much of frequentist statistics, "conservatism" al- 
ready has technical meanings in the statistics literature, e.g., conservative confidence 
intervals have higher-than-nominal coverage rates. For that reason, the term "caution" 



will be used when assigning an operation a 



toward ambiguity in th e sense of 



EllsbersJ (jl96lh . 



defi nition to the degree of conservatism 



In an example from lEUsbergl (ll96lf ). a ball is randomly drawn from an urn of 90 
balls, each of one of three possible colors: red, black, and yellow. Nothing is known 
about the distributions of the balls in the urn except that exactly 30 are red. Thus, 
there is ambiguity in the distribution of black and yellow balls. The agent would gain 
a reward of $0 or $100 based on its taking action I or action II according to utility 
function displayed as Tabled] in setting 1. In setting 2, the agent would instead gain 
$0 or $100 based on its taking action III or action IV according to the utility function 
displayed as Table [2j Agents cautious toward ambiguity would choose action I over 
action II in setting 1 but would take action IV over action III in setting 2, against 
subjective Bayesian co ncepts of c ohere nce but without requiring the extreme caution 



of a minimax strategy (lEllsberg 



19611 ) ■ 



In the absence of ambiguity, the axiomatic system of 



(1953 



von Neumann and Morgenstern 



§3.6) and later generalizations prescribe choosing the action that maximizes 



expected utility. By forcefully applying such a system to conditional expectations 
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given observed data, ISavagd ( 119541 ) revitalized Bayesian statistics. The action that 
maximizes expected utility with respect to a Bayesian posterior is called the posterior 
Bayes action. Ambiguity about the posterior is usually modeled in terms of a set 
V of multiple posteriors in place of a single posterior. A multiplicity of posteriors 



may 



arise from insufficient elicitation of subjective prior opinions (e.g.. 



2000l ). from a spread in a gamble's buying and selling prices (e.g., 



Ber ger et a. 



Wallev 



1991( 1 



or, more objectively, from ignorance as to which prior distribution in a set describes 



the physical variability of a paramete r . Th e 



tion of ambiguity a s used in 



Gajdos et al 



(120041 ) 



EllsbereJ (119611 ) 



ast source accords best with t 



ne no- 



Jaffravl (I1989bf ) 



Jaffravl (I1989at ). and 



In the Bayesian statistics literature, the most studied decision-theoretic ap proach 



for sets of priors is the (marginal) T-minimax strategy (e.g. 



Berger et al. 



20001 ). which 



formu lates the problem in terms of minimax risk in the freque ntist sense of 



Wald 



Betro and Ruggeri 



(1961). The closely related conditional T-minimax strategy (e.g. 
19921 ) takes the action that minimizes the expected loss maximized over all of the 
posterior distributions in a set V, each member of which corresponds to a prior 
distribution in a set traditionally denoted b y T. That statistical strategy is a special 



case o f the maxmin expected utility strategy (IHurwicz 



1951b: 



Gilboa and Schmeidler 



19891 ). which takes the action that maximizes the expected utility m inimized oyer a se t 



Vidakovid (J2000|). 



of distributions. Both "robust Bayesian" strategies are reviewed in 

The following equation extends the conditional T-minimax strategy to the problem 
of conducting statistical inference at a specified degree of caution k and with respect 
to a Bayesian posterior P G V that is not generally the true physical distribution of 
the parameter 9. For any k G [0, 1], the K-conditional-T (kCG) action is defined as 



d K = arg inf k sup / L (9, a) dP' (9) + (1 - k) / L (9, a) dP {9) ) , (1) 
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with the conventions that KXoo = 0ifK = and (1 — k) x oo = if k = 1. The kCG 
action reduces to the conditional r-minimax action under complete caution (/t = 1) 
and to the posterior Bayes action in the complete ab sence of cautio n (k = 0). For 



discre te 9, this k is i somor phic to quantities used by 



and 



(120041) . anc 



Jaffravl (I1989H ) that 



Ellsberd (J1961|L 



Tapkingl (12004)) and is simila r in spirit to the 



Augustinl (120021 ) calls "caution." 



quantity in 



Gaidos et a 



Hurwiczl ( 



Gajdos et al. 



1951a 



(120041 ) stressed 



the equivalent of the rearrangement of equation (TjQ) as 



a K = arg inf I sup / L (9, a) dP (9) 

"^' 4 ' Pe{nP'+{i-K,)P-.P'ev} ■ 



(2) 



The kCG strategy has two drawbacks that will prevent its use in many applica- 
tions. First, under standard loss functions, the conditional r-mini max (ICG) strategy 



requi res either that V impose strict bou nds on the paramet e r spac e (lAbdollah Bayati and Parsianl . 



201 ll ) or that A be severely restricted (IBetro and Ruggeri 



19921 ). and the kCG strat- 



egy with < k < 1 has the same limitation. Second, since the ICG strategy is not 
necessarily a frequentist procedure, the kCG framework does not fulfill the above 
goal of formulating procedures that reduce to frequentist procedures given complete 
caution. 

Following the preliminary notation and definitions of Section [21 an information- 
theoretic framework will be introduced in Section [3] to overcome the identified limita- 
tions of the kCG framework. Simple examples demonstrating the wide applicability 
of the information-theoretic framework will appear in Section HI Further generality 
will be carried out in Section [5] by exchanging roles of frequentist and Bayesian pro- 
cedures and by noting the particular applications that call for each role exchange. 
Section [6] closes the paper with a brief discussion. 
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2 Bayesian and frequentist posterior distributions 



2.1 Preliminary concepts 

The observed data vector x G X is modeled as a realization of a random variable X of 
probability space (X, X, -P^a,), which for some parameter set 0* x A* is indexed by 
an interest parameter 9* G 0* and potentially also by a nuisance parameter A* G A*. 
The family {Pe t ,x, ■ 0* G 0*, A* G A*} will be called the sampling model for x. 

Inferences will be made about the focus parameter 9 = 9 (9*), a subparameter of 
the interest parameter, in a set 6. In the simplest case, 9 = 9* and = 0*, but 
there are many other possibilities. For example, when testing the null hypothesis that 
9* = against the alternative hypothesis that 9* ^ for 0* = R, it is convenient 
to define the focus parameter by 9 = if 9* = and 9 = 1 if 9* ^ 0, in which case 
= {0, 1}. Let V, denote a cr-field that allows any physically meaningful hypothesis 
about 9 to be expressed as "9 is in 0t," where 0t G "H. 

2.2 Bayesian posteriors 

In the Bayesian setting, the above sampling model is understood as conditional 
on the parameter values with respect to some prior distribution, as follows. Ev- 
ery member pP rior of some set P Pnor is a distribution such that there is a random 



triple (X,e\,K) ~ PP rior and such that P e , M = Pf ior [•% = 6>*, A* = A J for all 



9* G 0*, A* G A*. 

Let V denote the set of all probability distributions on (0,"H). Before observing 
data, knowledge about the focus parameter is represented by pP nor ) the set of all 



distributions of ( X, 9 0* ) for all PP rior G PP rior : 




pprior = jpprior e <p : (x,6 ) ~ P^ , (x , 0*, A^ ~ Pf™ G Pf^} 
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The marginal distribution of each 9 [9* J is i n "P an d is called a plausible prior dis- 
tribution since it is consistent with pre-data knowledge. 

The Bayesian approach yields inferences about the focus parameter on the basis 
of a single distribution, pP rior £ "ppn°r jf ^x, ^ ppnor^ ^hen the working prior 
distribution is the marginal distribution of 9. It follows that the working prior is one 
of the plausible priors. 



The working Bayesian posterior P and the knowledge base V (ITopsOe 
defined such that 

p = pp" or l m \x = x 



2004J) are 



V = jp' L\X = x) : P' G pP rior | 



P is simply the Bayesian posterior distribution corresponding to the working prior, 
and V is likewise the set of Bayesian posteriors in V that correspond to plausible 
prior distributions. To prevent confusion with P, members of V will be referred to 
as plausible posteriors since they are the parameter distributions consistent with the 



mathematical 



TopsOe 



1979 



repre sentation either of a physical system or of a belief system (cf. 



20041 ). Thus, the posterior that would be used in purely Bayesian 



inference is one of the plausible posteriors ^P G Vj . 
2.3 Confidence posteriors 

The sampling model of Section 12.11 admits not only system constraints and Bayesian 
inference ( §2.2[) but also frequentist inference in the form of confidence intervals and 
p-values. Let "H* denote i3(B*), the Borel set of 0*. Given some a G [0,1], if the 
function (1 — a, •) : X — t satisfies 

P w ,feG§(l-a;I)) = l-« (3) 
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for all 9* G 0* and A* G A*, then G (1 — a; X) is called a 100 (1 — a) % confidence 
set for 6*. Suppose : [0, 1] x X — > 0* defines nested confidence sets in the sense 
that (1 — a; X) is a 100 (1 — a) % confidence set for 6* given any confidence level 
1 — a G [0, 1] and such that 

< 1 - oti < 1 - a 2 < 1 =^ 9 (1 - ai; X) c § (1 - a 2 ; X) 

almost surely. A confidence posterior distribution is a distribution P* on (0*,%*) for 
which 



p + (0t) = e e + ) G {l - a : 0(1 -a-x) = 6 t ,0 < a < l} 



(4) 



for all x G <Y and 0^ G H*, where 6** is a random variable of distribution P*. 
Polanskyi ( 120071 ) called P* (0^) the observed confidence level of the hypothesis that 
6* G 0^. Confidence posteriors for which 9* is a real scalar (0* C R) and the o- 
field is Borel ("H* = i3(0*)) are usually called confidence distributions, each of which 



encodes confidence interva 



pie null hypotheses (jEfron 



s of a' 



1 confidence levels and hypothesis tests of all sim- 



19931 ). Various devices extend confidence posteriors to 



cas es in which their posterior 



els ( jSchweder and Hjort 



j robabilities only approximat e 



2002 



Singh et al. 



2005 



Polansky 



y match confidence lev- 



2007 



Bickel 



2011b) 



The identity between confidence posterior probabilities and levels of confidence 
(J4j) clears up the misunderstanding that confidence levels and p-values cannot be 
interpreted as epistemological probabilities of hypotheses given the observed data. 
In fact, since P* is a Kolmogorov probability measure on parameter space, decisions 
made using various loss functions by the confidence posterior action 



a = arg inf / L(6*, a) dP* (6*) 



9 



for each loss function L are coherent with each other in the senses usually associated 
with Bay esian inference , whether or not P* can be derived from some prior via Bayes's 



theorem (IBickel 



2011c 



Let P* denote the set of confidence posteriors on (©*,%*) that are under consid- 
eration. For example, P* could be the set of a single confidence posteri o r, the set of 
all distributions on (0*,%*) that satisfy equation (j3J), or, as in iBickell (1201 lbl ). the 
set of two approximate confidence posteriors or the convex set of all mixtures of the 
two. 

The set P will represent the set of distributions of 9 (&*) f° r an P* £ P* : 

p = jp" eV:9 (e^j ~ P", 0.. - /"'. ■: P. } . 

Thus, for any P* G P*, there is a random parameter 9 = 9 (&*) of distribution 
P G P such that P Cd E {9 {9'i) : O'l G 6+}) = P* G f ) for all 0t g H*. V will 
be considered as a set of confidence posterior distributions of the focus parameter 
even though more literally they are not necessarily confidence posteriors but rather 
fiducial-like distr ibutions derive d from the set P* of confidence posteriors by the laws 



of probability. (IHannigl (120091 ) provides a recent review of fiducial inference.) In 
the simplest case of 9 = 9*, (©,%) = (©*,%*) and P = P*. While confidence 
distributions are used here for concreteness, P can be a set of any distributions on 
(0,"H) to use as benchmarks with respect to which the posterior introduced in the 
next section is intended as an improvement. 
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3 Framework of moderate inference 



3.1 Moderate posteriors 



Let P and Q denote probability distributions on (Q,7i). The information divergence 
of P with respect to Q is denned as 



I(P\\Q) = j dP log ( 



dP\ 
dQj 



(5) 



if Q is absolutely continuous with respect to P and I (P\\Q) = oo if not, where 
01og(0) = and log (0/0) = 0. I (P\\Q) goes by many names in literature, includ- 
ing "Kullback-Leibler information" and "cross entropy." Viewing I (P\\Q) as infor- 
mation leads to the concept of how much information for statistical inference would 
be gained by replacing a confidence posterior P" G V with another posterior Q G V 
if the plausible posterior P' G V were the physical distribution of the parameter 9. 
Specifically, 

/ (P'\\P" ^Q) = I (P'\\P") - I (P'\\Q) , 



clS cL special case of "information gain" ( Pfaffel huber 



gain of Q rel ative to P" given P' ( 



Bicke 



19771 1 . is called the inferential 



2011al ). (The -w notation is borrowed from 



Topsoel ( 12007h .) 

In analogy with equation (j2J), the caution k G [0, 1] is then the extent to which a 
"worst-case" plausible posterior P' G V is used for inference as opposed to the working 
Bayesian posterior P in this definition of the ^-inferential gain of Q relative to P" 
given P' and P: 



i (p', p\\p" -*>q-,k\=i Up' + (i - «) p\\p" ~* q) . 



(6) 



The posterior distribution that has the highest ^-inferential gain in the following sense 
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will be used for making inferences and decisions. The moderate posterior distribution 
with caution k relative to V given V and P is denoted by P K and defined by 

inf inf I (p\ P\\P" ~> P K ; k) = inf sup inf / (p' , P\\P" ~> Q; k) . (7) 
p"eVP'ev v ' P"ev QeT P'er ^ ' 

Less technically, P K is the posterior distribution that maximizes the worst-case infer- 
ential gain relative to the confidence posterior P", which is in turn chosen to minimize 
the maximum worst-case gain. In the case that equation (JTj) does not have a unique 
solution, the moderate posterior is defined to be as close as possible to the working 
Bayesian posterior: 

P K = arg inf. I (P\\P'") , (8) 

where the set V K of candidate moderate posteriors is defined as the set of all distri- 
butions in V such that every member of V K solves equation 0. By letting 



J (P\\P" ^Q;k)= inf I (P', P\\P" ~» Q; nj (9) 

for any P" EV, that set may be written as 

V K = \ PeV : inf J (P\\P" ~* P; k) = inf sup J (P\\P" Q; k)\ . 
{ P"ef> ^ ' P"eVQeV ^ ' J 



(10) 



The moderate posterior action with caution k is 

S K = arginf / L (9 , a) dP K {9) , (11) 

which defines making decisions on the basis of the moderate posterior as taking actions 
that minimize its expected loss. For example, if P is the only confidence posterior 
under consideration, then V = j-P j an d 
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P K = argsup QeP (inf Pe ^ K I (p\\P ~> Q 



\ = [kP 1 



+ (1- k)P : P' eV 



(12) 



(13) 



which recalls equation ([2]). Since P G V K C P, P = |-Pj> an d ^i = ^ the effect 
of k < 1 as opposed to k = 1 is to replace the knowledge base V w ith a subset V K 



containing the working Bayesian posterior P (cf. iGajdos et al. 



20041 ). 



The two extreme cases of caution reduce decision making to previous frameworks. 
A complete lack of caution (k = 0) leads to the sole use of the working Bayesian 
posterior for the minimization of posterior expected loss: Po = P • On the other 
hand, complete caution {k = 1) leads to ignoring the working Baye sian po s terior and, 



in the case of a single confidence posterior, to the framework of 
which Pi is called the blended posterior. 



Bickell fl2011al ). in 



3.2 Minimum information divergence 



'he 



(120071 ) 



ol 



owing 



Bickell (2011aj), and especially 



act is from iTopso'd (119791): see also IPfaffelhuberl (119771 ) , lHarremoes 



Tops^el (I2007[ ) 



Lemma 1. Given a distribution P on (G, H), ifV is convex and I (P||P) < oo for 
all P EV, then 



sup inf I (P'\\P ~* Q = inf I (Q\\P) . 

The resulting theorem is useful for finding P K : 

Theorem 1. For any k G [0, 1], if V is convex and 7(P'||P") < oo for all P' G V 
and P" G V , then the moderate posterior P K is given by equations (jSJ) and (|T3|) with 



P EV : inf I(P\\P") 

P"ev 



inf inf I(Q\\P") 

P"ev Qev K 



(14) 
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Proof. For any k G [0, 1] and P" G P, equations (Q and (TT2j) . with Lemma [TJ imply 

sup J (P| \P" ^Q;k) = sup inf / (P\\P" ~> Q) , 

= mf/(Qim 

Q(zPk 

and thus 

inf sup j(P||P" ^Q;k) = inf inf J(Q||P"). 

P"eVQev \ ' P"ePQei> K 

Equation (ITOl) thereby reduces to equation (JHJ). □ 
An immediate consequence is 

Corollary 1. Given V = ^ s convex and I ^P'||P^ < oo for all P' G V , 

then the moderate "posterior is 

P K = arg inf I (q\\P) . (15) 

Inference is also simplified when at least one of the confidence posteriors is suffi- 
ciently close to the working Bayesian posterior: 

Corollary 2. IfVC\V K is nonempty, V is convex, and I (P'||P") < oo for all P' G V 

and P" G V , then the moderate posterior is the confidence posterior that is closest to 
the working Bayesian posterior in the sense that 

P K = arg inf I (P\\P") . (16) 

Proof. For any P" G PHP K , equation © implies bo th inf n ^ HQ\\P") = I (P "\\P") = 
and I (P\\P") > for all P G P\ (v D V K ] (e.g., hover and ThomaJ bofld). Thus, 
by Theorem [H 



V K = |P G P :/(P||P") = 0,P" G PHP K } =PHP K , 

14 



which was assumed to be nonempty. Equation (fT6|) then follows from equation dSJ) . □ 

Remark 1. Unless k = 0, the condition that V C\V K be nonempty holds whenever the 
plausible posteriors are sufficiently unrestricted. The most important such setting for 
applications is a complete lack of constraints (v = P j , in which case 

v n v K = v n \kP + (1 - «) p : p g p} 

and, if P is convex and unbounded, then P fl V K = V fl P = P for any k G (0, 1]. 



4 Examples 

The first two examples involve the continuous, scalar parameters typical of point 
and interval estimation (B = R). For simplicity, each uses only a single confidence 
posterior (v = jpj^. 

Example 1. Pqi is the normal distribution of mean 9 and variance 1, i.e., X ~ 
N 1), and X = x is observed. Further, 9 ~ N( / u,cr 2 ) with unknown [i and o of 
known lower and upper bounds: fx G [/i, /l] ; a G [<7, a]. For generality, the bounds are 
extended real numbers: fi G {— oo} U R, a G [0, oo), Jl — lU {oo}, a = [0, oo) U {oo}. 
The intervals [fi, /l] and [a, a] are open only as required to ensure that /i, log cr G M. 
in the presence of infinite bounds or q_ = 0, e.g., /i = 0,/I = oo [/x, /l] = [0, oo). 



The working prior is N (/i, a 2 ) 



theorem (e.g., 



Carlin and Louid . 



or som e given fi G [/x, /i] and d G [a, cr]. By Bayes's 



2009|), 



P = N 



1 + & 2 ' 1 + & 2 



i 



P= <^N 



/i + a 2 a; a 2 \ r 1 

TT ^, TT ^J:^G[/,, / ,], ( xG[a,a]j 



By contrast, P is N (x, 1), not depending on any prior. This P is a genuine confidence 
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posterior ( §2.3p . as can be verified from the fact that P (j) < 9j = Pg t i (X > x) for 
all G K and x G R. The performance of any estimator 6 of 6 may be quantified 
by its squared-error prediction loss: L {o,9j = (@ ~ @j ■ By equation (JTJ), the kCG 
estimate is 



(/,,. = arginf [ « sup / ( 9 - 9) dP' (9) + (1 - k) [ (b - 9)* dP (9) 

P'ev- 



arginf ( n 



+ (1-k) / (9-9) dP{9) 



where fx (9 j = /iif 9 — fx > 9 — fx and /j ^0 j = /i otherwise; P^fg^ a — N ^/i ^0^ , cr 
If a = 0, -f^g^ o * s ^ ne Dirac measure at /i u?J , implying that 



d K = arg inf ( k ( - fx [ 9 ) ) 2 + (1 - k) / (fl - 0) 2 rfP (9) ) , 



which only has a solution if /x > — oo and fx < oo. Those restrictions are not needed 
for the estimate based on the moderate posterior. Since 



V K = S^kN + {1 - k) P : li e [fx,]x] ,<7G [<7,<7]j 



(17) 



Corollary [TJ entails 



P K 



arg inf / I N 

/^e \fi,ji\ ,o-e[(7,CT] 



1 + a 1 1 + a 2 ' 



arg inf lo Q 



1 + a 2 a 2 
+ 



1 + a 2 \l + a 



ix — x 



with the second equality from, e.g., iKuUbackl (119681 . p. 189). Substituting P K into 
equation (ITT]) gives the moderate-posterior-mean as the estimate of 9: 



a K = arginf / (0-0) g?P k (0) = 9dP K {9) 
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which is unique even if fi = — oo, fi = oo, q_ = 0, and a = oo. 

The next example drops the parametric assumptions about the plausible prior 
distributions. 

Example 2. X ~ N (9, 1) with no information about 9 except that 9 G R = O, that 
X = x is observed, and that P is the working Bayesian posterior distribution of 9. 
It follows that V is the set of all distributions on the Borel space (R, i3(R)). Again 
under quadratic loss, by equation ([T|), the kCG estimate is 

d K = arginf sup J (9- 9^ dP' (9) + (1 - k) j (3 '- 9^ ' dP (9)j , 

which is the posterior mean j 9dP (9) if k — but which has no unique value for 
any other value of k since sup p , 6 p f (9 — 9^j dP' (9) = oo for any 9. By contrast, 
equation (ITT]) specifies the unique moderate-posterior estimate given P = N (x, 1): 

5 K = arginf / (o - oY dP K (9) = [ 9dP K (9) , 
emJ v J J 

where, provided that k > 0, P K = P according to Corollary [2] since P G V = V K , 
leading to 

a K = J 9dP (9), 

the frequentist posterior mean. 

The last example involves a discrete focus parameter, as is typical of hypothesis 
testing and model selection applications. 

Example 3. Consider the indicator parameter 9 defined such that 9 = if the 
null hypothesis about 9** is true = 0) and 9 = 1 if the alternative hypothesis 
about 9** is true (9** ^ 0). Equivalently, in terms of 9* = \9^\, 9 = if 9* = 
and 9 = 1 if 9* > 0. If P is a working Bayesian posterior for 9^, then P y9 = 

17 



is the corresponding working Bayesian posterior probability that the null hypothesis 
is true. Let p^> and p^ denote observed p-values of the one-sided test of 0* = 
versus 0* > and thus of the two-sided test of 9** = versus 0** ^ 0. In this 
example, pW (x) < p^ (x), perhaps because p^ (x) is based on a test that makes 
weaker parametric assumptions than that of p^ (x). For i = 1, 2, let Pj i} denote the 
confidence posterior for 9* defined given some x G X such that 

P« (ft, < 0,) = (p« (X) < p« (a;)) 

for all 9* G ©* and A* G A*, where the dependence of P* on x is suppressed, in 
Section [2731 Since pW (X) ~ U(0, 1) under the null hypothesis that 9* = 0, it follows 
that 

P* W (I < 0) = PW (^ = 0) = P ,a, (P W (X) < p« (x)) = pW (x) , (18) 



i.e., t he confidence po sterio r probability of th e null 



value (Bicke 



2011dl]ah : cf. 



van Berkum et al. 



hypothesis is equal to the p- 



(119961). With 9 = if 9* = and 
= 1 if 9\ 0, equation (J8]) yields pW (0 = p) 

_ p(Q From wide l y appli cable 



2001 



Bickel 



2011a 



and 



conditions for two-sided hypothesis testing (iSellke et all 

• prior 

with some P G (0, 1) given as the lower bound of the prior probabilities of the 
null hypothesis and the restriction that no such probability is 1, the knowledge base 



is 



V = jp' G V : P (0 = 0) = P({0}) < P'({0}) < 1 
the set of plausible posteriors, the distributions on ({0, 1} , 2^°'^) with 



Pa 



P 9 = 



-prior 

1 ~ £-0 



PJT°V 2) (x) log [l/p (2) (x)]. 



a pr r 



as the lower bound of the plausible posterior probability of the null hypothesis, where 



Q_ ~ P. That lower bound is the greater o f the t wo lower bounds found by separately 



applying the methodology of 



Sellke et al. 



(1200 lh to pV> (x) and p^ (x). (The binary 



operator A in the above equation means "the minimum of," and V will similarly stand 
for "the maximum of.") Since Theorem [1] applies, the moderate posterior P K is given 
by equation (jSJ) with 



V K = {PEP:I(P\\PW\ A/(P||P( 2 )) =/(pi 1 )||pW) A/(pi 2 )||p( 2 ))} 

= {p® ■■ i (m\p {t) ) = i (m\p {i) ) a / (pi 2) np (2) ) ,i6 {i, 2}} , 

where P^ = arginf Q6 ^ / (q||P w ) ; V K = \kP' + (1 - k) P : P' G P, P < P' (0 = 0) < 1 
More simply, 



pi 1} if / (p« | |pw) </(pi 2) | |p( 2 ) 

pi 2) if / (pP I |pw) >/(pi 2) l |p( 2 ) 

Pi X) if / (Pi 1} I |P«)=/(pi 2) I |P( 2 >) and/(p||pW) <l(P\\pW 

Pi 2) if I (Pi 1} | |P«) = / (Pi 2) ||p( 2 )) and I (p||P«) > / (p\\pW 



Letting P% = P ( 9 = ) and letting # denote the focus parameter according to the 
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moderate posterior (o ~ P K j , 



5(0 
(fl = o) 



arg inf V Q (6 = j) log j) 



arg inf 

Q0e{KP +(i-K)P0:P e[P ,i)} 



Q 8 log -^r + (1 - Q#) bg - 1 ffi 

<pW [x) 1 — pW (x) 



' p« (x) 



= arg inf Qolog f - w , 

Q0e[KP +(i-«)P ,«+(i-«)P ) P l ; W 



= < 



kP + (1 - k) P if p« (x) < kP % + (1 - k) P 

pW (x) if «P + (1 - k) P < p w (x) < k + (1 - k 

K + (1 - K) P if p W (x) > K + (1 - «) P . 



Since P K G P K 



I (0 = o) 



| K £ + (!-«) P } if p« (x) < pW (x) < kP^ + (!-«) P 



(p( 2 ) (x)} 



if p« (x) < «P + (1 - «) P < p( 2 ) (x) < K + (1 - K 



G < 



{p« (x) , p< 2 ) (x) } if «P + (1 - k) P < p (1) (x) < p< 2 ) (x) < K + (1 - K 



{p« (x)} 
« + (l - k)P } 



if KP d + (1 - «) P < p« (x) < K + (1 - K) P < 
if p( 2 ) (x) > p« (x) > « + (1 - «) P , 

(20) 



from which the extreme condition p^ (x) < «P + (1 — n) P < «; + (1 — n) P$ < 
p^ (x) is omitted for brevity. In the case of no caution, the working Bayesian posterior 
probability is recovered: Po (0 = 0) =P(0 = 0), 

which does not depend on p(x). 
More interestingly, the case of complete caution leads to 



Pi 



G < 



{t (It = 0) } if p (1) (x) < (x) < P 

(p( 2 )(x)} if P W (X) < P < p& (X) 

{p« (x) , p( 2 ) (x) } if P < pW (x) < p( 2 ) (x) , 



(21) 
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which has no dependence on P [9 = OJ . The simplifying effect of considering only a 
single p- value is evident from using p^ (x) = p^ (x) in the formulas ( 1201) and ( 12TT) . 
For example, expressio n (12T1) r esults in a unique Pi [9 = 0) equal to the blended 



posterior probability of 



Bickell (l2011al ). When formulas ( 12 Op and (T2"Tj) say no more 



than P K y9 = OJ G {p^ (x) ,p^ (#)}, equation (jSj) ensures the uniqueness of the 
moderate posterior probability by equating it with the p- value closest to the working 
Bayesian posterior probability: 



P K = arg inf I ( P\ \P"') = P (?) ; 

P"' e {p(i)(a;),p( 2 )(a;)} ^ ^ 

ar^mf^ ( P (i = o) log^> + P (0 = l) log , 



which is a special case of Corollary |5J In this way, the caution parameter, the working 
Bayesian posterior, and the constraints on the plausible posteriors together overcome 
the dilemma of whether to use the more conservative p- value or the less conservative 
p- value. 



5 Extending the caution framework 
5.1 Variations of the framework 

The above framework for balancing Bayesian and frequentist approaches to inference 
does not apply to all situations encountered in applications. The various permutations 
of the Bayesian and confidence posteriors as the working posterior P, used exclusively 
in the absence of caution, and a benchmark posterior P, over which inference will be 
improved as much as possible, in equations ( JT2|) and ( IT5|) lead to four versions of the 
proposed approach: 
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1. P is a Bayesian posterior in V, and P is a confidence posterior. This version 
yields the balance between Bayesian and frequentist inference defined in Section 
[3] and illustrated in Section HI 

2. P is a confidence posterior, and P is a Bayesian posterior in V. The potential 
uses of this reversal are unclear since it would paradoxically lead to dependence 
on a single Bayesian posterior to the extent of the caution. 

3. P = P, where P is a Bayesian posterior in V. Using the same Bayesian posterior 
as both the working posterior and the benchmark posterior is attractive in the 
absence of reliable confidence intervals or p-values from which a confidence 
posterior could be constructed. Thus, this version extends the scope of the 
framework across the domains to which Bayesian methods apply. However, this 
version becomes trivial whenever equation (fT5l) holds according to Corollary (TJ 
for in that case, P K = argmfg^ / ^Qll-Pj = P f° r & U K £ [0> 1] since P G V K 
necessarily. In other words, the Bayesian posterior would be used for inference 
irrespective of the degree of caution and the knowledge base. 

4. P = P, where P is a confidence posterior. Using the same Bayesian posterior 
as both the working posterior and the benchmark posterior is useful when a set 
V of plausible posteriors can be specified but when no member of that set can 
be singled out as special. In many cases involving a continuous parameter 9, 
no such member can be derived from the knowledge base V without imposing 
arbitrary procedures such as averaging over the members with respect to some 
measure chosen for convenience. That will be explained in Section 15. 2\ where 
the case of two unequal confidence posteriors will also be considered. 

For simplicity, the versions are described as if V = j-P jj but they also pertain to a set 
V of multiple benchmark posteriors that define the moderate posterior P K according 
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Setting 


P 


For? 


Bayes and frequentist approaches apply 


Bayes posterior 


confidence posterior 


no confidence intervals or p-values 


Bayes posterior 


continuous 9 & only a set of Bayes posteriors 


confidence posterior 



Table 3: Settings for three versions of the proposed framework. 



to equation (jSJ) . The three most applicable of those versions are summarized in Table 
3. 

Generalizing beyond those versions, the working posterior P can be any posterior 
distribution that would be used exclusively in the absence of caution, whereas when 
there is caution, inferences are made with the goal that they improve upon those that 
would be made using any other posterior distribution in V. The application at hand 
can help determine which of those distributions is a Bayesian posterior and which 
is some other type of distribution such as a confidence distribution. For example, 
given a working posterior P from a proper prior, a posterior from an improper prior 
could be use d as th e bench mark posterior P in the absence of a suitable confidence 



posterior (cf. 



Bickel 



2011a 



5.2 Inference without a working Bayesian posterior 

The more interesting of the two widely applicable variations of the framework is that 
in which both P and P are the same confidence posterior. Thus, some implications of 
using a single confidence posterior simultaneously as the working posterior and as the 
benchmark posterior (^P = P^j merit noting. First, complete caution (k = 1) leads 
to ignoring the role of the confidence poster i or as a working posterior and thereby 



collapses to the blended inferences of 



Bickell (l2011al ). Second, when k < 1 and the 



confidence posterior is not a plausible posterior yP Vj , the moderate posterior 
may not be a plausible posterior {^P K ^ Vj. In fact, whenever sufficient conditions 
for Corollary [1] are met, P K will be plausible only if P is plausible. That suggests the 
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use of k = 1 in the absence of a working Bayesian posterior in order to avoid excessive 
dependence on P at the expense of V, the knowledge base. On the other hand, 
allowing P K V makes P K less dependent on the precise borders of V, and this may 
be desirable to the extent that such borders are uncertain or subjectively specified. 
Third, when P EV, the moderate posterior is simply equal to the confidence posterior 
P K = Pj under the sufficient conditions for equation (1151) given by Corollary [TJ The 
following examples illustrate the second and third implications. 

Example 4. (Variation of Example [2j) This example is trivial since P G V, P = P, 
and Corollary [1] entail P K = P. 

More generally, P EV and P = P imply P K = P by Corollary [2j 

Example 5. (Variation of Example [3j) Because P (§ — 0\ — p(x), the identity 
P = P yields P$ = p (x), reducing equation ( 1T91) to 



p. (e = o 



^£0 + (i - «0p(z) ifp(^)<£0 
P ( x ) if P (x) > Pq. 



Re-expressing this as P K {& — Oj = kP^ + (1 — n) p (x) Vj)(i) facilitates compari- 
son with the eq uation Pi (^9 = 0^= Pj^p (x) used in the blended inference framework 



flBickel 



2011a ). Therefore, k G [0, 1) entails that P K (^6 = 0^ is less than the lower 



bound P whenever p (x) < P . That would be clearly unacceptable if P^ im is scien- 

,pr 
-0 



• prior 

tifically established, but if Pa is instead highly uncertain or subjectively assessed, 



then P K (o = j can bypass P as warranted. 



An alternative to the above approach in the absence of a 
strategy of Section [3] with P as a function of V, following 



specifie c 



P is to apply the 



Gaidosl (J2008|). Examples 



of functions that transform a set of distributions to a single distribution include 
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the Steiner point (IGajdos 



20081). the arith metic mean ("center of mass"), and the 



maximum entropy distribution ((Paris 



19941 ) . In the continuous-parameter case, such 



functions require a base measure for partitioning. 

There is no need to impose an arbitrary base measure if two different confidence 
posteriors P and P are under consideration (^P ^ Pj . Using them as the working 
posterior and the benchmark posterior in equations ( fT2l) and fl!5j) would be most 
appropriate when P represents a newer or more risky procedure and when P corre- 
sponds to a better established or more thoroughly tested procedure. More generally, 
equation (jSJ) specifies how to apply a working confidence posterior P with a set V of 
benchmark confidence posteriors. 



6 Discussion 



The featured moderate-posterior methodology has been contrasted with the simpler 
kCG methodology. As Examples [1] and |2] illustrated under quadratic loss, the former 
can yield unique actions in a wide variety of settings in which the latter cannot. 
Using CG minimaxity (« = 1), uniqueness has b een achieved under quadr atic loss 



by restricting the action space to finite b ounds ( 



similarly restricting the parameter space (lAbdollah Bayati and Parsian 



Bet ro and Rugg eri 



1992)and by 



20111) • The 



moderate-posterior estimators did not require such restrictions. 

The main advantage of the moderate-posterior framework is that it provides first 
principles from which a statistician may derive a Bayesian analysis, a frequentist 
analysis, or a combination of the two, depending on the chosen level of caution and 
on the quality of prior information. This allows the caution level to be precisely 
reported with the resulting statistical inferences. In addition, the caution level may 
be determined by the needs of an organization or collaborating scientist rather than 
by the personal attitude of the statistician. 
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Various factors may be considered in choosing the level of caution. For example, 
more caution with Bayesian inference may be warranted when the confidence pos- 
terior represents a frequentist procedure that has stood the test of time than when 
it represents a new frequentist procedure based on questionable assumptions. The 
caution level could then be interpreted as the pre-data degree of reluctance an agent 
has in modifying the frequentist procedures encoded in the confidence posterior. 

The moderate-posterior framework of Section [3] is general enough to incorporate 
conflicting frequentist approaches, as seen in Example [3j For additional generality, 
Section 15.21 provides ways to modify the framework for situations in which any de- 
pendence on a subjective or guessed Bayesian posterior would be undesirable. 

In other situations, any dependence of inference on the level of caution would be 
undesirable. Provided that there is at least a little caution, the use of a sufficiently 
broad set of plausible posteriors under the unmodified framework (§3]) eliminates any 
other dependence on the degree of caution (Remark [T]). 
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